B&B with Local Heaps by nguidotti · Pull Request #1149 · NVIDIA/cuopt

nguidotti · 2026-04-27T11:22:35Z

In this PR, each best-first worker has its own local node heap, such that it push/pop nodes without synchronizing with other workers. Each best-first worker periodically steals a node from a random worker to keep the node distribution more or less balance across them. Additionally, each best-first worker has a (fixed) set of diving worker assigned to it, which are used for performing diving on its own nodes whenever possible. This essentially eliminates the need of the scheduler thread, freeing one additional thread to do something useful.

This also implements a compression scheme for vstatus using only 2bits per entry, which reduces the memory consumption by roughly 4x (previously was using int8_t per entry). Last, but not least, this PR replaces std::deque with a fixed-capacity circular_deque_t for the plunge/dive stacks and the idle-worker list.

MIPLIB results (GH200, 10min):

================================================================================
main (1, #1099) vs bnb-local-heap (2)
================================================================================

------------------------------------------------------------------------------------------------------------------------------
|                                        |       Run 1        |       Run 2        |     Abs. Diff.     |   Rel. Diff. (%)   |
------------------------------------------------------------------------------------------------------------------------------
| Feasible                                                 227                  228                   +1                 --- |
| Optimal                                                   75                   78                   +3                 --- |
| Solutions with <0.1% primal gap                          124                  130                   +6                 --- |
| Nodes explored (mean)                              4.866e+06            1.436e+07           +9.496e+06                +195 |
| Nodes explored (shifted geomean)                        6772            1.205e+04                +5275               +77.9 |
| Relative MIP gap (mean)                               0.3264               0.3415             +0.01506               +4.62 |
| Relative MIP gap (shifted geomean)                    0.1156               0.1131              -0.0025               -2.16 |
| Solve time (mean)                                      444.6                441.5               -3.054              -0.687 |
| Solve time (shifted geomean)                           221.5                219.1               -2.327               -1.05 |
| Primal gap (mean)                                      11.57                11.15              -0.4201               -3.63 |
| Primal gap (shifted geomean)                          0.6324               0.5604             -0.07203               -11.4 |
| Primal integral (mean)                                 32.63                33.02              +0.3805               +1.17 |
| Primal integral (shifted geomean)                      6.346                6.405             +0.05989              +0.944 |
------------------------------------------------------------------------------------------------------------------------------

In summary, we explored ~3x nodes in average` at the same time frame. The number of optimal solutions also increased by 3.

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

Remove dependency on rmm::mr::device_memory_resource base class. Resources now satisfy the cuda::mr::resource concept directly. - Replace shared_ptr<device_memory_resource> with value types and cuda::mr::any_resource<cuda::mr::device_accessible> for type-erased storage - Replace set_current_device_resource(ptr) with set_current_device_resource_ref - Replace set_per_device_resource(id, ptr) with set_per_device_resource_ref - Remove make_owning_wrapper usage - Remove dynamic_cast on memory resources (no common base class) - Remove owning_wrapper.hpp and device_memory_resource.hpp includes - Add missing thrust/iterator/transform_output_iterator.h include (no longer transitively included via CCCL)

…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # cpp/src/utilities/cuda_helpers.cuh

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

# Conflicts: # ci/validate_wheel.sh

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-28T14:38:10Z

/ok to test 37e757a

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-28T16:35:04Z

/ok to test b2e5f8c

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

chris-maes

Thanks for the nice discussion @nguidotti and congratulations on the nice performance improvement with this PR. I'm removing the Request changes so that you can merge when ready.

A couple suggestions:

It sounds like the only reason node_queue_t exposes lock and unlock is to allow a worker to steal a node from another. It would probably be better to add a method like node_queue_t::steal_from_victim(node_queue_t& victim) and handle the locking and unlocking of both queues directly in this method. That would allow you to not expose node_queue_t::lock/unlock and make it so that people touching the branch and bound code did not need to be concerned about correctly locking and unlocking the node queue. Inside steal_from_victim you can avoid deadlocks by acquiring the lock on the thief and the victim in sorted order according to their worker id (so this might need to be node_queue_t::steal_from_victim(i_t thief_id, i_t victim_id, node_queue_t& victim))

Ideally, stealing a node is an atomic operation, so that the node is always either in one queue or another, and thus the node's lower bound is always considered. If you are able to make it an atomic operation you can avoid the need to track the lower bound associated with the node separately (which may be prone to bugs).

Also, if diving needs to copy a node from the node queue, and that cannot happen while stealing, you can add a method node_queue_t::copy_node that acquires mutex_ internally.

Maybe you are able to make the above changes before merging.

Longer term, I think it's worth defining the correct abstractions and data structures to make managing the lower bound simpler. A heap is already the ideal data structure for managing the lower bound, since it inherently takes the lower bound over the nodes it contains. I think we've introduced a lot of book-keeping and other data structures to manage nodes and lower bounds outside the heap. This is likely because we don't have a way to walk nodes in the heap (i.e. the standard C++ data structures only support pushing and popping). If we had a heap where we could walk nodes, I think it would simplify many of operations within the branch and bound code. Instead of popping a node off the heap, and tracking the lower bound, when solving, we could leave it on the heap and just mark that node as "solve in progress". We would only pop a node from the heap when the solve was completed. When trying to steal a node from a heap or dive from a node, thieves could avoid "solve in progress" nodes. Also, during a plunge we could push child nodes that we are not exploring directly onto the heap, instead of keeping them in a separate stack or circular buffer data structure.

If the code maintained the invariant that all open nodes are in a heap, I think it would be much easier to reason about the correctness of branch and bound.

…logic for launching new bfs workers and work stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

…ressing the packed buffer Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-29T12:28:14Z

/ok to test d094751

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-29T13:05:55Z

/ok to test 6f7ab06

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti · 2026-05-29T16:21:41Z

/ok to test d18c1e9

nguidotti · 2026-05-29T18:40:19Z

/merge

bdice and others added 30 commits April 3, 2026 13:51

split worker and worker pool in separated file. code cleanup.

e77dbc2

simplified logic for pseudo cost (and its snapshot) for the regular a…

62d0452

…nd deterministic mode. Signed-off-by: Nicolas Guidotti <224634272+nguidotti@users.noreply.github.com>

fixed compilation

a517f13

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

added missing header

f31599c

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

fixed guard against no incumbent when calling guided diving

202738f

Signed-off-by: Nicolas Guidotti <nguidotti@nvidia.com>

addressing code rabbit comments. replaced AT in pseudo_costs_t with a…

4aed76c

… shared_ptr to avoid unnecessary copy. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

missing dereference

a5c111d

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

919e445

split best-first and diving worker into separated objects

76ce1bb

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

increase the wheel size limit

c433e41

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed rng offset

52db538

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

increasing wheel size limit for CUDA 12

3676432

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

first version of the B&B workers with local heaps

d2f6eb7

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

implemented a lock-free stack to track the idle workers. fix potentia…

6a39187

…l crash in work-stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed lower bound calculation at end of the B&B. reverted to locking …

dec671c

…queue for now. refactoring. Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

correctly handles the node in the stack when the solver stops if they…

1b3a282

… are present Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added atomic in node queue to track size and lower bound without a lock.

e108a54

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

replaced std::deque with a circular buffer.

315aca6

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge remote-tracking branch 'upstream/main' into rmm-cccl-migration

536a692

# Conflicts: # cpp/src/utilities/cuda_helpers.cuh

Inline upstream memory resource variable in test fixture MR composition

31a6eab

Replace deprecated rmm::mr set_*_resource_ref calls with set_*_resource

f889d28

renamed method

3469026

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'main' into simplify-pseudocost

8e8c794

# Conflicts: # ci/validate_wheel.sh

Merge branch 'main' into simplify-pseudocost

3e6aa83

merging with main branch

e0444c2

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fixed compilation

f3e863f

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge remote-tracking branch 'upstream/main' into rmm-cccl-migration

76c9ece

fixed small bugs

56bf9ed

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

added cleanup routine for the diving heap

18e1e83

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti added 4 commits May 28, 2026 15:08

fix diving worker inactive assert

417c580

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

add additional assert

bbdcb77

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

simplified the bfs worker launch.

c05b964

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

fix non-monotonic objective from the heuristics in the logs

37e757a

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

rg20 removed request for a team and tmckayus May 28, 2026 15:13

chris-maes reviewed May 28, 2026

View reviewed changes

Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp Outdated

chris-maes reviewed May 28, 2026

View reviewed changes

Comment thread cpp/src/branch_and_bound/branch_and_bound.cpp

chris-maes reviewed May 28, 2026

View reviewed changes

Comment thread cpp/src/branch_and_bound/node_queue.hpp Outdated

chris-maes reviewed May 28, 2026

View reviewed changes

Comment thread cpp/src/branch_and_bound/node_queue.hpp

all push to the queue now uses the lock

b2e5f8c

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

do not re-insert worker back to the pool when it still has nodes

483d200

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

chris-maes approved these changes May 28, 2026

View reviewed changes

nguidotti added the do not merge Do not merge if this flag is set label May 28, 2026

nguidotti added 2 commits May 29, 2026 13:23

removed explicit lock/unlock methods from the node queue. simplified …

c4604f0

…logic for launching new bfs workers and work stealing Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

additional fixes for node_queue. pass vstatus as argument when decomp…

d094751

…ressing the packed buffer Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti added 2 commits May 29, 2026 15:05

code cleanup

19af0f4

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

Merge branch 'release/26.06' into bnb-local-heap

6f7ab06

nguidotti added 2 commits May 29, 2026 18:20

Merge branch 'release/26.06' into bnb-local-heap

58697da

simplify the code using std::scoped_lock

d18c1e9

Signed-off-by: Nicolas L. Guidotti <nguidotti@nvidia.com>

nguidotti removed the do not merge Do not merge if this flag is set label May 29, 2026

rapids-bot Bot merged commit a339f1c into NVIDIA:release/26.06 May 29, 2026
99 checks passed

nguidotti deleted the bnb-local-heap branch May 29, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B&B with Local Heaps#1149

B&B with Local Heaps#1149
rapids-bot[bot] merged 117 commits into
NVIDIA:release/26.06from
nguidotti:bnb-local-heap

nguidotti commented Apr 27, 2026 •

edited

Loading

Uh oh!

nguidotti commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nguidotti commented May 28, 2026

Uh oh!

chris-maes left a comment

Uh oh!

nguidotti commented May 29, 2026

Uh oh!

nguidotti commented May 29, 2026

Uh oh!

nguidotti commented May 29, 2026

Uh oh!

nguidotti commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

nguidotti commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

nguidotti commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nguidotti commented May 28, 2026

Uh oh!

chris-maes left a comment

Choose a reason for hiding this comment

Uh oh!

nguidotti commented May 29, 2026

Uh oh!

nguidotti commented May 29, 2026

Uh oh!

nguidotti commented May 29, 2026

Uh oh!

nguidotti commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nguidotti commented Apr 27, 2026 •

edited

Loading